On the regularization effect of stochastic gradient descent applied to least-squares
نویسندگان
چکیده
We study the behavior of stochastic gradient descent applied to $\|Ax -b \|_2^2 \rightarrow \min$ for invertible $A \in \mathbb{R}^{n \times n}$. show that there is an explicit constant $c_{A}$ depending (mildly) on $A$ such $$ \mathbb{E} ~\left\| Ax_{k+1}-b\right\|^2_{2} \leq \left(1 + \frac{c_{A}}{\|A\|_F^2}\right) \left\|A x_k \right\|^2_{2} - \frac{2}{\|A\|_F^2} \left\|A^T A (x_k x)\right\|^2_{2}.$$ This a curious inequality: last term has one more matrix residual $u_k u$ than remaining terms: if $x_k x$ mainly comprised large singular vectors, leads quick regularization. For symmetric matrices, this inequality extension higher-order Sobolev spaces. explains (known) regularization phenomenon: energy cascade from values small smoothes.
منابع مشابه
The Regularization Effects of Anisotropic Noise in Stochastic Gradient Descent
Understanding the generalization of deep learning has raised lots of concerns recently, where the learning algorithms play an important role in generalization performance, such as stochastic gradient descent (SGD). Along this line, we particularly study the anisotropic noise introduced by SGD, and investigate its importance for the generalization in deep neural networks. Through a thorough empi...
متن کاملStochastic Optimization Algorithm Applied to Least Median of Squares Regression
The paper presents a stochastic optimization algorithm for computing of least median of squares regression (LMS) introduced by (Rousseeuw and Leroy 1986). As the exact solution is hard to obtain a random approximation is proposed, which is much cheaper in time and easy to program. A MATLAB program is included.
متن کاملA Markov Chain Theory Approach to Characterizing the Minimax Optimality of Stochastic Gradient Descent (for Least Squares)
This work provides a simplified proof of the statistical minimax optimality of (iterate averaged) stochastic gradient descent (SGD), for the special case of least squares. This result is obtained by analyzing SGD as a stochastic process and by sharply characterizing the stationary covariance matrix of this process. The finite rate optimality characterization captures the constant factors and ad...
متن کاملStochastic Proximal Gradient Descent for Nuclear Norm Regularization
In this paper, we utilize stochastic optimization to reduce the space complexity of convex composite optimization with a nuclear norm regularizer, where the variable is a matrix of size m × n. By constructing a low-rank estimate of the gradient, we propose an iterative algorithm based on stochastic proximal gradient descent (SPGD), and take the last iterate of SPGD as the final solution. The ma...
متن کاملIterate averaging as regularization for stochastic gradient descent
We propose and analyze a variant of the classic Polyak-Ruppert averaging scheme, broadly used in stochastic gradient methods. Rather than a uniform average of the iterates, we consider a weighted average, with weights decaying in a geometric fashion. In the context of linear least squares regression, we show that this averaging scheme has a the same regularizing effect, and indeed is asymptotic...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Electronic Transactions on Numerical Analysis
سال: 2021
ISSN: ['1068-9613', '1097-4067']
DOI: https://doi.org/10.1553/etna_vol54s610